Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A General Learning Method for Automatic Title Extraction from HTML Pages

Identifieur interne : 000831 ( Main/Exploration ); précédent : 000830; suivant : 000832

A General Learning Method for Automatic Title Extraction from HTML Pages

Auteurs : Sahar Changuel [France] ; Nicolas Labroche [France] ; Bernadette Bouchon-Meunier [France]

Source :

RBID : ISTEX:D4D1E3040C032904E47DC6D9E7209FF37CE927F5

Abstract

Abstract: This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format. In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties. We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques. Based on these features, learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that combining both methods can induce better performance.

Url:
DOI: 10.1007/978-3-642-03070-3_53


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct:series">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A General Learning Method for Automatic Title Extraction from HTML Pages</title>
<author>
<name sortKey="Changuel, Sahar" sort="Changuel, Sahar" uniqKey="Changuel S" first="Sahar" last="Changuel">Sahar Changuel</name>
</author>
<author>
<name sortKey="Labroche, Nicolas" sort="Labroche, Nicolas" uniqKey="Labroche N" first="Nicolas" last="Labroche">Nicolas Labroche</name>
</author>
<author>
<name sortKey="Bouchon Meunier, Bernadette" sort="Bouchon Meunier, Bernadette" uniqKey="Bouchon Meunier B" first="Bernadette" last="Bouchon-Meunier">Bernadette Bouchon-Meunier</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:D4D1E3040C032904E47DC6D9E7209FF37CE927F5</idno>
<date when="2009" year="2009">2009</date>
<idno type="doi">10.1007/978-3-642-03070-3_53</idno>
<idno type="url">https://api.istex.fr/document/D4D1E3040C032904E47DC6D9E7209FF37CE927F5/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000396</idno>
<idno type="wicri:Area/Istex/Curation">000390</idno>
<idno type="wicri:Area/Istex/Checkpoint">000353</idno>
<idno type="wicri:doubleKey">0302-9743:2009:Changuel S:a:general:learning</idno>
<idno type="wicri:Area/Main/Merge">000839</idno>
<idno type="wicri:Area/Main/Curation">000831</idno>
<idno type="wicri:Area/Main/Exploration">000831</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">A General Learning Method for Automatic Title Extraction from HTML Pages</title>
<author>
<name sortKey="Changuel, Sahar" sort="Changuel, Sahar" uniqKey="Changuel S" first="Sahar" last="Changuel">Sahar Changuel</name>
<affiliation wicri:level="3">
<country xml:lang="fr">France</country>
<wicri:regionArea>Laboratoire d’Informatique de Paris 6 (LIP6), DAPA, LIP6, 104, Avenue du Président Kennedy, 75016, Paris</wicri:regionArea>
<placeName>
<region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
<author>
<name sortKey="Labroche, Nicolas" sort="Labroche, Nicolas" uniqKey="Labroche N" first="Nicolas" last="Labroche">Nicolas Labroche</name>
<affiliation wicri:level="3">
<country xml:lang="fr">France</country>
<wicri:regionArea>Laboratoire d’Informatique de Paris 6 (LIP6), DAPA, LIP6, 104, Avenue du Président Kennedy, 75016, Paris</wicri:regionArea>
<placeName>
<region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
<author>
<name sortKey="Bouchon Meunier, Bernadette" sort="Bouchon Meunier, Bernadette" uniqKey="Bouchon Meunier B" first="Bernadette" last="Bouchon-Meunier">Bernadette Bouchon-Meunier</name>
<affiliation wicri:level="3">
<country xml:lang="fr">France</country>
<wicri:regionArea>Laboratoire d’Informatique de Paris 6 (LIP6), DAPA, LIP6, 104, Avenue du Président Kennedy, 75016, Paris</wicri:regionArea>
<placeName>
<region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2009</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">D4D1E3040C032904E47DC6D9E7209FF37CE927F5</idno>
<idno type="DOI">10.1007/978-3-642-03070-3_53</idno>
<idno type="ChapterID">53</idno>
<idno type="ChapterID">Chap53</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format. In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties. We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques. Based on these features, learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that combining both methods can induce better performance.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
<region>
<li>Île-de-France</li>
</region>
<settlement>
<li>Paris</li>
</settlement>
</list>
<tree>
<country name="France">
<region name="Île-de-France">
<name sortKey="Changuel, Sahar" sort="Changuel, Sahar" uniqKey="Changuel S" first="Sahar" last="Changuel">Sahar Changuel</name>
</region>
<name sortKey="Bouchon Meunier, Bernadette" sort="Bouchon Meunier, Bernadette" uniqKey="Bouchon Meunier B" first="Bernadette" last="Bouchon-Meunier">Bernadette Bouchon-Meunier</name>
<name sortKey="Bouchon Meunier, Bernadette" sort="Bouchon Meunier, Bernadette" uniqKey="Bouchon Meunier B" first="Bernadette" last="Bouchon-Meunier">Bernadette Bouchon-Meunier</name>
<name sortKey="Changuel, Sahar" sort="Changuel, Sahar" uniqKey="Changuel S" first="Sahar" last="Changuel">Sahar Changuel</name>
<name sortKey="Labroche, Nicolas" sort="Labroche, Nicolas" uniqKey="Labroche N" first="Nicolas" last="Labroche">Nicolas Labroche</name>
<name sortKey="Labroche, Nicolas" sort="Labroche, Nicolas" uniqKey="Labroche N" first="Nicolas" last="Labroche">Nicolas Labroche</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000831 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000831 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:D4D1E3040C032904E47DC6D9E7209FF37CE927F5
   |texte=   A General Learning Method for Automatic Title Extraction from HTML Pages
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024